作業目的: Data Visualization (02) Text

這份作業希望能夠讓你熟悉中文文字處理,並執行基本的文字相關分析,再將結果以圖表呈現。過程中會運用到過去幾週影片中的 document-level, word-level text analysis, regular expression, and text mining.

這次的作業使用維基文庫提供的歷任中華民國總統就職演說。因為總統就職演說本身代表了每一屆總統任期的,以其重要性,因此國內外媒體時常使用演說的內文當作素材,利用文字探勘的技巧寫出報導,以 2020 年為例,大家可以參考中央社的蔡總統關心什麼 文字會說話 以及 readr 的 少了「年輕人」多了「防疫」:臺灣歷屆民選總統就職演說字詞分析。國外的則可以參考 “I Have The Best Words.” Here’s How Trump’s First SOTU Compares To All The Others. by BuzzFeed, Word Aanalysis of 2016 Presidential debates - Clinton vs. Trump by Martin Krzywinski, and Trump used words like ‘invasion’ and ‘killer’ to discuss immigrants at rallies 500 times: USA TODAY analysis by USA today.

小小的反思:直接用資料、直接用斷詞結果(台灣 vs. 臺灣)可能會出錯喔!

作業: Data Visualization (02) Text

### 這邊不要動
library(tidyverse)
library(jiebaR)
library(tidytext)

df_speech <- read_csv("data/AS06/df_speech.csv")
### 給你看資料長這樣
df_speech %>% glimpse()
#> Rows: 15
#> Columns: 6
#> $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
#> $ term      <chr> "一", "二", "三", "四", "五", "六", "七", "八", "九", "十", "十一", "十…
#> $ year      <dbl> 1948, 1954, 1960, 1966, 1972, 1978, 1984, 1990, 1996, 2000…
#> $ president <chr> "蔣中正", "蔣中正", "蔣中正", "蔣中正", "蔣中正", "蔣經國", "蔣經國", "李登輝", "李…
#> $ title     <chr> "中華民國第一任總統就職演說總統 蔣中正1948年5月20日\n", "中華民國第二任總統就職演說總統 蔣中正195…
#> $ text      <chr> "  中正承國民大會依照憲法選舉為中華民國總統,擔任國家和人民的公僕,當此就職伊始,追念我 國父和先烈締造民國的艱難…

0. 斷詞:

請利用 library(jiebaR) 斷詞,過程中也要保留詞性的欄位。

### your code

cutter <- worker()
tagger <- worker("tag")
stopWords <- readRDS("data/AS06/stopWords.rds")
no_word <- c("掌聲","一個","政府","未來","我們","使","今日","乃是","亦","再","唯","公","人","此一","公","新","過去","四年","中","持續","今天","之上","年","一百三十","萬","支柱","一塊","我會","事情","更好","相同","邁進","之間","兩千","經由","現場","最","應該","我要","前","一起")
segment_not <- c("台灣","臺灣","馬英九","李登輝","蔡英文","蔣經國")

new_user_word(cutter, segment_not)
new_user_word(tagger, segment_not)

unnested.df <- df_speech%>%
  mutate(word = purrr::map(text, function(x)segment(x, tagger))) %>%
  select(id,president,year,word) %>%
  mutate(word = purrr::map(word, function(x)str_c(names(x),"_",x))) %>%
  unnest(word) %>% 
  separate(word,c("pos","word"),sep = "_" ) %>%
  filter(word != " ") %>%
  filter(!(word %in% stopWords$word)) %>%
  filter(!(word %in% no_word)) %>%
  filter(!str_detect(word, "[a-zA-Z0-9]+")) %>%
  mutate(word = str_replace(word,"臺灣","台灣"))
#> [1] TRUE
#> [1] TRUE

1. 整體熱門詞彙:

請先找出所有總統演說當中出現次數最高的 10 個詞彙,接著計算每屆總統演說時,這些詞彙出現的次數,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code

unnested.df %>%
  count(word,sort = T) %>%
  slice(1:10)

ten_word <- c("台灣", "國家", "民主", "發展", "社會", "國際", "經濟", "自由", "和平","中華民國")

unnested.df %>%
  filter(word %in% c("台灣", "國家", "民主", "發展", "社會", "國際", "經濟", "自由", "和平", "中華民國")) %>%
  count(year,word) %>%
  select("次數" = n , year, word) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(x = year , y = word , fill = 次數)) +
  geom_tile() +
  scale_fill_gradient(low = "white",high = "red")+
  ylab("詞彙")+
  xlab("年份")+
  theme_bw()+
  theme(text=element_text(family="黑體-繁 中黑", size=10),
        #legend.title=element_blank(),
        legend.position = "top",
        panel.border = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank())+
  labs(title = "歷屆總統就職演說熱門詞彙")

#scale_x_discrete(limits=c("1948","1954","1960","1966","1972","1978","1984","1990","1996","2000","2004","2008","2012","2016","2020"))+
            
### your result should be
# 自己畫就好唷
#> # A tibble: 10 x 2
#>    word         n
#>    <chr>    <int>
#>  1 台灣       289
#>  2 國家       193
#>  3 民主       175
#>  4 發展       129
#>  5 社會       127
#>  6 國際       109
#>  7 經濟       109
#>  8 自由        83
#>  9 和平        81
#> 10 中華民國    81

2. 各自熱門詞彙:

請先找出各個總統演說中,出現次數最高的 10 個詞彙,並且將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code

unnested.df %>%
  group_by(president) %>%
  count(word,sort = T) %>%
  slice(1:10) %>%
  mutate(word = fct_reorder(word,n)) %>%
  ggplot(aes(x = word, y = n))+
  geom_col()+
  coord_flip() + 
  facet_wrap(~president, scales = "free_y")+
  ylab("詞彙")+
  xlab("次數")+
  theme_bw()+
  theme(text=element_text(family="黑體-繁 中黑", size=10),
        legend.title=element_blank(),
        legend.position = "top",
        panel.border = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "歷屆總統就職演說熱門詞彙")

### your result should be
# 自己畫就好唷

3. TF-IDF:

請先篩掉各個總統演說中出現次數小於 5 的詞彙,接著計算 TF-IDF (不知道這是什麼的話請看老師影片!),最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code


unnested.df %>%
  group_by(word) %>%
  filter(n() > 5) %>%
  ungroup() %>%
  count(president, word) %>% 
  bind_tf_idf(word, president, n) %>%
  group_by(president) %>%
  arrange(-tf_idf) %>%
  slice(1:10) %>%
  mutate(word = fct_reorder(word,n)) %>%
  ungroup() %>%
  ggplot(aes(x = tf , y = word))+
  geom_col() +
  facet_wrap(~president, scales = "free_y")+
  ylab("詞彙")+
  xlab("tf_idf")+
  theme_bw()+
  theme(text=element_text(family="黑體-繁 中黑", size=10),
        legend.title=element_blank(),
        legend.position = "top",
        panel.border = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "歷屆總統演說_TFIDF")

### your result should be
# 自己畫就好唷

4. 捉對廝殺:

請先留下蔡英文和馬英九的用詞,接著計算兩者用詞數量差異最大各自前十名的詞彙,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 有意義的 詞彙!

### your code

unnested.df %>%
  filter(president == c("蔡英文","馬英九")) %>%
  count(word,president) %>%
  spread(president, n, fill = 0) %>%
  mutate(diff = 蔡英文-馬英九) %>%
  arrange(-diff) %>%
  group_by(diff > 0) %>%
  top_n(10, abs(diff)) %>%
  ungroup() %>%
  mutate(word = reorder(word, diff)) %>%
  ggplot() + aes(word, diff, fill =ifelse (diff > 0, "蔡英文","馬英九")) + 
  scale_fill_manual(values =c("#4F9D9D","#005AB5"))+
  scale_y_continuous(limits = c(-20,20),breaks = seq(-20,20,10))+
  geom_col() + 
  coord_flip() + 
  xlab("詞彙")+
  ylab("使用次數差異")+
  theme_bw()+
  theme(text=element_text(family="黑體-繁 中黑", size=10),
        legend.title=element_blank(),
        legend.position = "top",
        panel.border = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "馬英九與蔡英文使用差異最大詞彙")

### your result should be
# 自己畫就好唷